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ABSTRACT 

Previous divide-and-conquer segmentation analyses of DNA 
sequences do not provide a satisfactory stopping criterion 
for the recursion. This paper proposes that segmentation 
be considered as a model selection process. Using the tools 
in model selection, a limit for the stopping criterion on the 
relaxed end can be determined. The Bayesian information 
criterion, in particular, provides a much more stringent stop- 
ping criterion than what is currently used. Such a stringent 
criterion can be used to delineate larger DNA domains. A 
relationship between the stopping criterion and the average 
domain size is empirically determined, which may aid in the 
determination of isochore borders. 

1. INTRODUCTION 

A typical DNA sequence is not homogeneous. There are 
local regions with contrast: C+G rich versus C+G poor; 
protein-coding regions with a strong signal of periodicity- 
three versus non-coding regions lacking this periodicity; high 
densities of 5'-CG-3' dinucleotides (CpG island) versus low 
density of this dinucleotide; etc. Finding the exact border 
between these regions is an important task in DNA sequence 
analysis. It is a common practice to use a moving window 
to visually monitor the variation of the quantity of interest 
(e.g. C-l-G density) along the sequence, and the border is 
determined in an ad hoc way. With the sequence informa- 
tion, it is actually possible to determine the border exactly 
by certain mathematical criterion. 

These mathematical approaches to delineate regional homo- 
geneous domains are known as "segmentation" j2^, "parti- 
tioning" , or "change-point analysis" |l^ in different 
fields ranging from image processing to statistics. There are 
segmentation methods that require guessing the number of 
homogeneous regions. There are also segmentation methods 
that require specification of the number of types of domains 
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(e.g. C-l-G rich and C-l-G poor represent two types of do- 
mains, whereas C-l-G high, intermediate, and low specify 
three types). Segmentation analysis of DNA sequences can 
be found in ||, |l5[ §, H . 

One particularly attractive segmentation method is a divide- 
and-conquer approach ^ (similar recursion processes are 
also discussed in statistics and machine learning under the 
names of "classification and regression tree" [jl3, "recursive 
partitioning" [Q, "decision tree induction" pc|, |3l| ], etc). 
The DNA sequence is first segmented into two subsequences 
so that base compositions on two sides of the partition are 
maximized. Then, the same procedure is carried out on 
both the left and the right subsequences; and then on the 
sub-subsequences, etc. Eventually, either the size of a sub- 
sequence is too small to segment, or the difference between 
the left and right subsequences is not big enough to be worth 
further segmentation. Recursive segmentation offers the fol- 
lowing advantages: there is no need to specify the number 
of homogeneous domains beforehand; the number of types 
of domains need not to be specified (it is implied in the 
stopping criterion); there is no constraint on the size distri- 
bution of the domains (such a constraint exists in hidden- 
Markov-model-based segmentations); and the computation 
is efficient. 

This paper addresses one of the disadvantages of this seg- 
mentation: the stopping criterion of the recursion. Another 
disadvantage of this approach - the fact that the solution is 
a local maximum with no guarantee of the global maximum 
being obtained - is not addressed here. In principle, one can 
set any stopping criterion, leading to domains of any sizes. 
In the hypothesis testing framework of statistics, whether a 
test is "significant" or not (corresponding to a continuation, 
or a termination, of the recursion in our case) is decided by 
a pre-set "significance level". Usually, the significance level 
can be 0.05, 0.01, or 0.001. These levels are arbitrary and 
will not guarantee objectivity j^. 

We provide a stopping criterion based on the framework of 
model selection (for a detailed discussion of the hypothesis 
testing framework versus the model selection framework, see 
|l^). This new stopping criterion offers at a minimum con- 
dition for the recursive segmentation to continue. On the 
other hand, in the hypothesis testing framework, no such 
minimum condition exists; for example, the 0.06 significance 
level is weaker than the 0.05 level, and 0.1 is even weaker 
than 0.06, etc. In the model selection framework, there are 



two different guiding principles. The first is to clioose a 
model that most closely approximates the true model. The 
second is to find the true model among a list of candidate 
models. The first principle leads to a technique of Akaike 
Information Criterion (AIC), and the second leads to the 
technique of Bayesian Information Criterion (BIC). We will 
show that BIC-based stopping criterion for segmentation is 
practically more useful. 



2. METHODS 

2.1 The divide-and-conquer segmentation in 
its original formulation 

The original publication of this divide-and-conquer segmen- 
tation method is called "entropic segmentation" be- 
cause the quantities used in determining the partition point 
are based on entropy, a statistical physics concept. The 
entropy of a sequence with length A'^ and number of bases 
(b) {Nc,} {a = a,c,g,t) is calculated as ("means that the 
quantity is estimated from the data) 
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Given a partition point i (1 < i < A), an entropy-based 
quantity called Jensen-Shannon distance (divergence) 
is defined as 

Djs = E{{N.}) - ^E{{N.A) - ^MiNcr}) (2) 

where {Nc,,i} and {Na,r} are the base counts of the left 
(from position 1 to i) and the right (from position i + 1 to 
position A) subsequences (with J2a ^"-i- ~ *' J2a ^a,r = 
N — i). The partition point i is chosen to maximize Djs- 



2.2 The divide-and-conquer segmentation as 
a likelihood ratio test 

In fact, the above entropic description can be cast into a hy- 
pothesis testing framework - the likelihood ratio test WA- 
Likelihood is simply the probability of observing the data, 
given a model, with emphasis on the functional dependence 
on the model parameter (in other words, the normalization 
coefficient is not needed). To test whether a model is "sig- 
nificant", the likelihood under the model (L2) is calculated, 
and maximized over all possible parameters (L2). A similar 
calculation is carried out on the null model (Li and Li). 
If the null model is the correct model of the data, and if 
the null model is nested in the alternative model, it can be 
shown that in the large sample size limit [ha: 
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where Xdf is the chi-squared distribution with degrees of 
freedom df (i.e. sum of df terms of squared unit normal 
distribution), K2 and K\ are the number of free parameters 
in maximizing L2 and L\. 

In our divide-and-conquer segmentation, Li is the likelihood 
assuming the sequence being a random sequence, and L2 is 



the likelihood assuming two random subsequences: 
-Li({Pa}) = n^""' 



(4) 



where {pa} {a = a,c,g,t) is the base composition of the 
whole sequence (here these are free parameters in the model 
to be estimated), {pa,i} and {pa,r} are the similar base com- 
positions of the left and right subsequences. The maximum 
likelihood estimation of a base composition is simply the 
percentage of the base: Pa = Na/N. It can easily be shown 
that 2 log{L2/Li) is the same as 2NDjs- The number of pa- 
rameters in the two models are K2 ~ 7 (the partition point 
i is also a free parameter) and Ki — 3. So 2NDjs under 
the null hypothesis should obey the Xd/=4 distribution (the 
same conclusion was reached before, see and (I Crosse, 
et al. in preparation), only the df used there is 3, instead 
4). 

2.3 The divide-and-conquer segmentation as 
a model selection 

There are many shortcomings in the hypothesis testing frame- 
work ^1 . The purpose of a test is to see how bad a descrip- 
tion of the data Li is, not how good a description L2 is. 
For many circumstances, it is not really what we are in- 
terested in. In the model selection framework, we directly 
address the "merit" of a model. One measure of such a 
"merit" is whether the model is close (a better approxima- 
tion) to the true model. The closeness is measured by the 
KuUback-Leibler distance (divergence) |2^], and the Akaike 
Information Criterion (AIC) is one approximation of this 
distance (with the constant term removed, and multiplied 
by a factor of 2) pi: 



^/C= -21og(L) +2A' + 0( — ) 
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where L is the maximized likelihood of the model, K is 
the number of free parameters in the model. A model with 
the lowest AIC is closest to the true model, thus the best 
approximating model. 

Another "merit" of a model is how the data increases the 
probability of the model (only in Bayesian statistics is it pos- 
sible to extend the concept of probability to the model and 
its parameters). The factor between the prior and posterior 
probability of the model is the "integrated likelihood" [ ^ . 
An asymptotic approximation of minus-twice the logarithm 
of the integrated likelihood is the Bayesian Information Cri- 
terion (BIC) Q: 

BIC = -2 log(L) + log{N)K + 0(1) + 0(-^) + 0(1), 

(6) 

where A is the sample size. A model with the lowest BIC has 
the largest integrated likelihood, and this translates to the 
largest posterior probability if all models have the same prior 
probability. Note that AIC emphasizes an approximation of 
the true model, and BIC emphasizes the selection of the true 
model from the space of all models. The high-order terms in 
AIC are discussed in p7| , and the derivation of BIC can 
be found in [^. It can easily be seen that if log(A) > 2 (or 



lambda phage (N=48.5k) 




Figure 1: Lambda (A) bacteriophage sequence. 



A'" > 7.389), the penalty on the number of model parameters 
(i.e. the second term in Eq.(^) and Eq.(^)) in BIC is more 
severe than that in AIC. As a result, BIC tends to select 
simpler models than AIC. 

When the segmentation is viewed as a model selection pro- 
cess, the model before the segmentation describes the se- 
quence as a random sequence, whereas that after the seg- 
mentation describes it as two random subsequences. Since 
AIC/BIC must decrease for the segmentation to continue, 
it can be shown that they lead to the two stopping criteria 
as follows: 



stopping threshold: 



AlC-based 



stopping criterion 



2ND.JS > 8-hO( — ) 



BIC-based 



stopping criterion 



2Nbjs > 41og(iV) + 0(l) + 0(-^) + 0(l).(7) 



It is interesting to compare these criteria with those in the 
hypothesis testing framework. Setting the value of 8 to 
the Xd/=4i the corresponding significance level (p-value, tail- 
area) is 0.091578. In the hypothesis testing framework, it is 
allowed to set an even more relaxed significance level such 
as 0.1, but in the AlC-based model selection, 0.091578 is 
the limit of allowed levels. Similarly, with a given sequence 
length A^, the limit of allowed levels can be determined by 
a BIC-based model selection; for example, if A'^ = 1Mb, the 
significance level is 2.8631 xlO"". A gain, the significance 
level can not be more relaxed than these limits. 

Besides limits on the relaxed side of the stopping crite- 
rion, there are no theoretical limits on the stringent side. 
One model can be "marginally better" than another model, 
"moderately better", or "much better", etc. We will show 
that one can gradually make the stopping criterion more 
stringent so that average domain size is increased. For con- 
venience, we define the "strength" of a l-to-2 segmentation 
as the percentage increase of 2NDjs over the BIC-defined 



f. , 2iVA.s- 4 log(jV) 
strengtn- 4iog(iV) 



(8) 



The strength has to be larger than 0, but it has no upper 
limit. 



3. RESULTS 

Since the AlC-based stopping criterion is more relaxed than 
the typical 0.01-significance-level test, one will end up with 
more domains than from the program discussed in |E8| . The 
BIC-based stopping criterion, however, is more interesting 
for our purpose, because it provides a theoretical justifi- 
cation for using a much more stringent stopping criterion 
than those typically used in the hypothesis testing frame- 
work. We illustrate the BIC-based segmentation by three 
DNA sequences with a wide range of sequence lengths. 

3.1 Lambda phage 

Fig.l shows the result for A bacteriophage (A'^ — 48,502 b) 
pa]. This sequence has been tested with various segmenta- 
tion methods in There are several pieces of information 
displayed in Fig.l: the domain borders obtained by the BIC- 
based segmentation on the original four-symbol sequence 
(upper bars); the borders segmented by the two-symbol (CG 
vs. AT) sequence (middle bars); the borders obtained by 
the AlC-based segmentations (with higher-order terms in- 
cluded) (dots; due to the limitation of resolution, individual 
dots can be hard to see); a moving- window C-fG content 
along the sequence; the strength of the segmentations as de- 
fined in Eq.(y) (lower spikes); and the sequential order of 
early-rounds of segmentations (e.g. the first partition point 
from the l-to-2 segmentation is labeled "1st"). We note the 
following: (1) Segmentation results from the four-symbol se- 
quence and the two-symbol sequence are different. (2) The 
number of domains by segmenting the four-symbol sequence 
is 6, which is the same as results from a two- state hidden 
Markov segmentation as discussed in §. (3) Early-rounds 
of l-to-2 segmentations are usually the "strongest" (with 
largest strengths). (4) Even without any tuning of param- 




Figure 2: Human major histocompatibility complex (MHC) sequence. 



drosophila ch2 left arm (N=22M) 





Figure 3: Left arm of Drosophila melanogaster chromosome 2 



eters (in contrast to tuning of the significance level in the 
hypothesis testing framework), the BIC-based segmentation 
manages to obtain a reasonable number of domains (AIC- 
based segmentation, on the other hand, leads to too many 
domains) . 

3.2 MHC 

Fig. 2 shows the result for the human major histocompat- 
ibility complex (MHC) sequence (iV = 3,673,778 b) 
The MHC sequence is a highly gene-rich region (with more 
than 200 genes) that is located on the short-arm of chromo- 
some 6 of the human genome. The segmentation result cap- 
tures the complexity of this sequence. With so many domain 
borders in Fig. 2, we only show those that have strengths 
larger than 100%. Historically, the MHC sequence is di- 
vided into three domains (in the telomere-to-centromere di- 
rection): class-I, class-HI (C-|-G-rich), class-II (C-|-G-poor). 
The MHC sequencing project added another C-|-G-rich do- 
main to the end: extended-class-II. Interestingly, the bor- 
der of these domains can be easily detected by segmen- 
tation (these results are from the two-symbol segmenta- 
tion): I/HI: i= 1,841,871, strength^ 23679.6%, III/II: i= 
2,483,966, strength^ 17084.7%, Il/extended-II: 1=3,384,907, 
strength=28849%. These three l-to-2 segmentations are the 
strongest. With this segmentation result, the domain sizes 
of class I, HI, II, and extended-II are: 1.84 Mb, 0.64 Mb, 0.90 
Mb and 0.29 Mb. The number of segmented domains in the 
MHC sequence is very large (1260 from the BIC-based two- 
symbol segmentation and 1828 from the BIC-based four- 
symbol segmentation) . Segmentation with the minimum re- 
quirement (i.e. for BIC to decrease) not only leads to large, 
lOOkb-pIus domains, but also leads to smaller-scaled base 
composition fluctuation. This "domains-within-domains" 
phenomenon has been discussed in |2^, ^, If one is 

only interested in isochores, i.e., large DNA segments with 
usually 300 kb or longer that have relatively homogeneous 
base composition [^, a more stringent criterion has to be 
used (to be discussed later). 



3.3 Left-arm of Drosophila chromosome 2 

The last sequence to be segmented is the left arm of Drosophila 
melanogaster chromosome 2 (TV = 22,075,671 b)|^]. There 
is 1.78% of the sequence that is not determined (symbol 
"n" or "N"). To preserve the location information, these 
undetermined symbols are replaced randomly by the four 
nucleotides (according to the actual base composition of 
this sequence). Only the l-to-2 segmentations with strength 
larger than 200% are included in Fig. 3, and only the result 
for the four-symbol sequence is displayed. The segmenta- 
tion of a four-symbol sequence is more likely to cut the 
telomere (as well as centromere) at an earlier stage than 
the corresponding two-symbol sequence; and this is shown 
in Fig. 3. This observation can be used to delineate complex 
sequence patterns in telomere sequences (D Kessler and W 
Li, in preparation). 

Although the drosophila sequence is much longer than the 
MHC sequence, there is only one l-to-2 segmentation of the 
drosophila's left-arm of chromosome 2 that has a similar 
strength as those of the MHC sequence leading to domain 
borders. This occurs at position 6,959,803 with the strength 
16768%. If we use a similar strength criterion as that used 
in delineating three domain classes in the MHC sequence, 
there is only one domain border in this sequence. 

3.4 How stringent the stopping criterion has 
to be to reach a certain domain size 

Since the model selection framework only provides a limit on 
the relaxed end of the stopping criterion, the stringent end 
is in principle open. Nevertheless, we can empirically deter- 
mine the typical domain size as a function of the stringency 
of the stopping criterion. Fig. 4 shows the average domain 
sizes versus the threshold value for the strength, all based 
on the four-symbol segmentation. Besides the three se- 
quences used in Figs. 1-3, results from Escherichia coli {N = 
4, 639, 221 b), the right arm of Drosophila melanogaster chro- 
mosome 2 (A^ = 20,228,487 b), and yeast Saccharornyces 
cerevisiae chromosome 3 (A = 315, 341 b) are also included. 
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Figure 4: Average domain size vs. threshold for segmentation strength. 



Fig. 4 shows that for the drosophila sequence (not surpris- 
ingly, the left and right arms of chromosome 2 behave simi- 
larly), the segmented domains have an average size of 1 Mb 
when the threshold of segmentation strength is set at 900%. 
When the threshold of strength is set at 500%, the average 
size of the segmented domains is around 300 kb - a rough 
minimum size for an isochore Q. Fig. 4 also shows that dif- 
ferent sequences behave differently. For the MHC sequence, 
for example, it is very difficult to delineate only the larger 
domains. To reach the average size of 300 kB, the threshold 
of strength has to be set at 1300%, and to reach 1 Mb, the 
threshold has to be around 2800% ~ 3200%. Another way 
to state this difference is that the MHC is more "complex" 
than other sequences in Fig. 4, in the sense of the existence 
of a huge number of domains. 

Plots like FiE.4 are similar to the "compositional complex- 
ity" |34[ H, |. The difference is that in ||^, ||, |||, not only 
the number of domains, but also the base composition differ- 
ence between domains is part of the measure of complexity. 
In Fig. 4, it is purely the number of domains. Nevertheless, 
the plot of Fig. 4 is useful because it provides practical guid- 
ance on the choice of stopping criterion at the stringent end. 
This choice will subjectively depend on what length scales 
are of interest to the investigator. 
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